Efficiencies for Working with IPEDS Peer Data

Alex McClung

Data Resource Analyst

April 3, 2023

Outline

  1. Automating & documenting custom IPEDS Data Center downloads

  2. Working with multiple years of IPEDS “Complete” data files

Motivation

Peer data needed for 14 Yale school profiles

️️

IPEDS Data Center Downloads

Browser automation with Playwright Test for Python 🎭

Playwright Test for Python 🎭

Save, modify, and run the codegen recorded code

from playwright.sync_api import Playwright, sync_playwright, expect

def run(playwright: Playwright) -> None:
    browser = playwright.chromium.launch(headless=False, slow_mo=50)
    ### RECORD VIDEO
    context = browser.new_context(record_video_dir="videos/")
    page = context.new_page()
    page.goto("https://nces.ed.gov/ipeds/use-the-data")
    page.get_by_role("link", name="Compare Institutions", exact=True).click()
    ### ADD/CHANGE PEER SETS
    page.locator("#tbInstitutionSearch").fill("217156,190150,190415,182670,166027,166683,186131,243744,144050,215062,130794")
    page.get_by_role("button", name="Select").click()
    page.get_by_text("Check All", exact=True).click()
    page.get_by_role("button", name="Continue").click()
    page.get_by_role("link", name="Continue").click()
    page.get_by_text("Admissions and Test Scores").click()
    page.get_by_text("Admissions and test scores", exact=True).click()
    page.get_by_text("Number of applications, admissions, and enrollees").click()
    ### ADD/CHANGE YEARS
    page.get_by_label("2021-22").check()
    page.get_by_label("Applicants total").check()
    page.get_by_label("Admissions total").check()
    page.get_by_label("Enrolled total").check()
    ### SAVE SCREENSHOTS
    page.screenshot(path="admissions-screenshot.png")
    page.get_by_role("img", name="Continue").click()
    page.get_by_role("img", name="Continue").click()
    with page.expect_download() as download_info:
        page.get_by_role("button", name="Continue").click()
    download = download_info.value
    ### EXPORT DATA AS CSV
    download.save_as("ivyplus-admissions.csv")

    # ---------------------
    context.close()
    browser.close()

with sync_playwright() as playwright:
    run(playwright)

IPEDS Complete Data Files

Working with multiple years of data with Arrow

  • Read multiple CSVs at once (or, as one)

  • Query data before reading it into memory

  • Add year of the filename to the data set

Arrow Example in R

Reading multiple CSVs with arrow::open_dataset()

library(arrow)
## EXAMPLE IPEDS DATA FILES (~60 Mb each)  
dir('data')
[1] "c2017_a.csv" "c2018_a.csv" "c2019_a.csv" "c2020_a.csv" "c2021_a.csv"
tic()
ds <- open_dataset('data', format = 'csv')
toc()
0.13 sec elapsed
## WHAT KIND OF OBJECT? OBJECT SIZE? 
class(ds); print(object.size(ds), units = "auto")
[1] "FileSystemDataset" "Dataset"           "ArrowObject"      
[4] "R6"               
504 bytes

Arrow Example in R

Query data before reading it into memory

ds <- open_dataset('data', format = 'csv') %>%
  ## YOUR QUERY HERE  
  filter(UNITID == 130794) %>% 
  group_by(AWLEVEL) %>% 
  summarize(Degrees = sum(CTOTALT, na.rm = TRUE))

## WHAT KIND OF OBJECT? OBJECT SIZE? 
class(ds); print(object.size(ds), units = "auto")
[1] "arrow_dplyr_query"
8.2 Kb

Arrow Example in R

Get year with add_filename() and collect() data into memory

open_dataset('data', format = 'csv') %>% 
  filter(UNITID == 130794, MAJORNUM == 1, CIPCODE == 99) %>% 
  mutate(file_name = add_filename(),
         Year = gsub('.*c([0-9]+)_a.csv', '\\1', file_name)) %>% 
  group_by(Year) %>% 
  summarise(Degrees = sum(CTOTALT, na.rm = TRUE)) %>% 
  collect() ## INTO MEMORY
# A tibble: 5 × 2
  Year  Degrees
  <chr>   <int>
1 2018     4494
2 2017     4345
3 2019     4789
4 2020     4863
5 2021     4121

Additional Uses for Arrow

  • You do not need to use it with multiple files
## READ AND QUERY BEFORE COLLECTING INTO MEMORY  
tic()
degrees_fy21 <- read_csv_arrow('data/c2021_a.csv', as_data_frame = FALSE)
toc()
0.54 sec elapsed
  • Read/write parquet data files, considered the default format ( or “the new CSV” 😎 ) in cloud computing
arrow::open_dataset()
arrow::write_parquet()
arrow::read_parquet()

Thank You!